DSC540 Project ML

Gerardo Palacios

Data Description

For my final project I will be working with a heart failure dataset. It was retrieved from Kaggle.com https://www.kaggle.com/fedesoriano/heart-failure-prediction. It is a small dataset containing 918 observations, 11 unique associated features and a target class.

It is a very clean dataset with no missing values.

The features included are shown below. It is a slightly unbalanced dataset with the target class having a 55%/44% split between having and not having heart disease. The dataset includes mainly male patients with varying vital measurements and accompanying cardiac symptoms.

Feature Description Mean/Most Frequent Standard Deviation\Ratio
Heart Disease Target class, 1: Heart Disease 0: Normal Heart Disease Heart Disease: 55%, Normal: 44%
Age Age of the patient 53.5 9.43
RestingBP Resting blood pressure [mm Hg] 132.39 18.51
Cholesterol Serum cholesterol 198.79 109.38
Fasting BS Fasting blood sugar [1: if FastingBS > 120 mg/dl, 0: otherwise] Normal 0: 76%, 1: 23%
Max HR Maximum heart rate achieved 136.80 25.46
OldPeak Numeric value measured in depression 0.88 1.06
Sex Sex of the patient Male M:78%, F:21%
ChestPain Type Chest pain type [TA: Typical Angina, ATA: Atypical Angina, NAP: Non-Anginal Pain, ASY: Asymptomatic] Asymptomatic TA:5%, ATA: 18%, NAP:22%, ASY: 54%
RestingECG Resting ECF Results [Normal: Normal, ST: having ST-T wave abnormality (T wave inversions and/or ST elevation or depression of > 0.05 mV), LVH: showing probable or definite left ventricular hypertrophy by Estes' criteria] Normal Normal: 60%, ST: 19%, LVH: 20%
Excercise Angina If patient has exercised-induced angina [Y: Yes, N: No] No Yes: 40%, No: 59%
ST_Slope The slope of the peak exercise ST segment [Up: upsloping, Flat: flat, Down: downsloping] Flat Up: 43%, Down: 06%, Flat: 50%

The distributions of the features are shown below. Age is the only variable that appears to be nearly normal, while the remaining 4 numerical variables appear to be skewed. All the categorical variables are unbalanced.

The following are cross-tabulations between the categorical variables and heart disease. Most of the cross-tabulations do not appear significant. Some of the most alarming are the cross-tabulations between heart disease and chest pain type. There is an alarming majority of patients that were asymptomatic but had heart disease. In addition, there appears to be some relationship between heart disease and having or not having exercise angina as well as having a flat ST Slope. However, that also may be due to the unbalance nature of both of those features.

However, despite the seemingly unremarkable cross-tabulations, the chi-squared test between the categorical variables and the target class all reject the null hypothesis that the variables are uncorrelated and can accept that the features are in correlated. As shown below, the p-values have 99% confidence that the features are correlated.

I can compute the point bi-serial correlation to measure the relationship between the continuous variables and the target class. As shown below, all the continuous features appear to have a moderately strong determinant relationship with having heart disease. All of the factors appear significant with heart disease.

Pre-Processing

In order to prepare the data for classification I'll need to perform a few tasks. First I'll need to encode the categorical variables, and then I have to standardize the dataset. Encoding the categorical variables can be done in two different ways. It can be done using one-hot encoding which converts each categorical variable as its own feature and encodes it with a binary value of 1 or 0. This is done for each categorical variable resulting in a higher-dimensional dataset. The second way is by way of Weight of evidence. Weight of evidence tells the predictive power of an independent categorical variable in relation to the dependent categorical variable. It transforms a categorical feature into a continuous feature in relation to the dependent variable. In this project, I will experiment with both methodologies and create and compare models in between both sets.

Next I will split the data into training and test and transform the data by applying MinMax Transformation. This is done on both sets of data.

Modeling

Finally, I can begin modeling. In order to compare across models I will be using accuracy and AUC as the comparable performance metric. In order to be thorough, in addition to experimenting with training and tests set, I will be further comparing using 5-fold cross validation using the entire dataset.

Logistic Regression

The logistic regression models performed fairly well and will be considered the base model. The logistic regression was trained on the one-hot encoding dataset and the WoE dataset (data1 and data2 respectively). In addition, a randomized search cv was used to do hyper-parameter tuning. As you can see from the following results, all the models scored above 84% accuracy and had mean 5-fold AUC score of at least 0.90. The iterations show little variance and tend to have a slight bias towards heart disease patients as shown by the confusion matrix and the probability distributions.

Logistic Regression Feature Selection - Sequential Feature Selector

Feature selection was also conducted by way of sequential feature selection. This methodology adds the feature that adds the most in predictive power in a sequential manner until there is no further improvement. As shown below, 5 of 11 features were selected, Age, Sex, ChestPainType, ExcerciseAngina, and ST_Slope as the most significant predictors towards heart disease. In fact when perform a logistic regression on the features alone, it achieves comparable metrics as when using all the available features.

Random Forest Classifier

The random forest models performed just as well as the logistic regression models. The test accuracy and AUC scores are comparable to the logistic regression. One advantage over logistic regression is that it seems to have a balanced ratio between predicting heart disease or not. It does not seem to favor one prediction over the other. It seems this is a more generalizable model. This can be further confirmed by the two opposing peaks in the probability distribution.

Random Forest Feature Selection - Sequential Feature Selection

Feature selection was also conducted using the random forest classifier and a sequential feature selection. There were multiple features that were selected that were also selected in the logistic regression feature selection. Feature selection involved 6 variables, Sex, ChestPainType, FastingBS, OldPeak, and ST_Slope. Sex, ChestPainType and ST_Slope were also selected by the logistic regression model. This adds evidence that these features are important in determining heart disease.

Support Vector Machines

Support Vector machine performed just as well as the previous models. The only downside is that the probabilities are close to the threshold, meaning that each prediction is not strong. This model also took the longest to train without adding much to the accuracy.

SVC Feature Selection - Sequential Feature Selection

The feature selection keeps selecting the same 3-5 features indicating that these features, Sex, ChestPainType, MaxHR, OldPeak, and ST_Slope are important predictors towards predicting heart disease.

XGBoost

As stated previously in the notebook, the dataset is unbalanced where 55% had heart disease and 44% were normal. One of the benefits of using XGBoost is that it has an attribute that can take the unbalanced nature of the dataset into consideration and yield a more balanced generalizable model. As seen in the results, XGBoost achieves comparable performance metrics with balanced ratio in the confusion matrix between correctly identified classes.

XGBClassifier Feature Selection - Sequential Feature Selection

The same five features are selected again by way of gradient boosting.

LightGBM

LightGBM is a new classifier that has not been introduced in class. It is a tree based algorithm that focuses on efficiency and accuracy. It was one of the most efficient models with the shorted training time. LightGBM also scores comparably to the other models and also appears to be balanced between both classes as shown in the confusion matrix and probability distribution.

LightGBM Feature Selection - Sequential Feature Selection

Four of the features were selected that overlapped with the previous iterations of feature selection. This has been confirmed by four other models that these are significant predictors.

AdaBoost

AdaBoost classifier performed marginally worse than the other models. It achieved a much lower accuracy and AUC score. The probability distribution appears to be balanced but many of the predictions are along the threshold. It apperars to have high variance between models as shown in the 5-fold CV ROC plot.

AdaBoost Feature Selection - Sequential Feature Selection

The same 3-5 features are selected once again. However, the predictions achieved are much more varied.

CatBoost

Catboost has been one of the marginally better models compared to the rest of the other ones. The probability distribution is still skewed and tend to favor heart disease class.

CatBoost Feature Selection - Sequential Feature Selection

The same 3-5 features are selected once again. However, the predictions achieved are much more varied.

Summary of Results

A total of 42 different models were built around predicting heart disease in a patient. With a total of 11 unique features that were composed of both categorical and continuous values. Two different data set samples were experimented with, one using one-hot encoding for the categorical features and the other using Weight of Evidence transformation. Each of the data samples were trained in three ways. First, by training the model using the default values. Second, by performing a randomized search for hyper-parameters tuning; and lastly by performing feature selection.

What are your findings?

All the models attained a test accuracy score between 82% - 86% and AUC values between 0.82 - 0.86. Shown below are the summarized table of results for all the models trained. Data1 represents the data sample using one-hot encoding transformation while Data2 represents the data sample using Weight of Evidence transformation. The 42 models were trained between 7 different classifiers. Unfortunately there is not one model that truly outperforms the others as all models achieved very similar scores with marginal improvements between each other. What stands out the most is how balanced the models are in predicting either class. Nearly all cases were biased towards patients with heart disease. Meaning, the model would most likely classify as having heart disease. This is due to the unbalanced nature of the data where the split was around 55%/44% having and not having heart disease. Two of the most generalizable models were the random forest classifier and XGBoost classifier. Those two models had the most balanced probability distributions and had balanced metrics between classes. They did not have the highest accuracy or AUC such as Catboost, but they had the most balanced performance metrics as shown by the F-1 scores, confusion matrices and probability distributions. The highest performance model is the Catboost model trained on Data2, Weight of Evidence tranformation. However, as stated before, it is biased towards heart disease. Since the results are only a marginal improvement over all the other models, the final model would have to be either random forest or XGBoost due to their generalizability and efficiency. Additionally, experimenting between one-hot encoding and weight of evidence had marginally similar results where the dataset trained by the data sample transformed by weight of evidence performed better than one-hot encoding. In other words, transforming the categorical variables into weight of evidence enabled a marginally better more parsimonious model. Lastly, another interesting finding between the models was the constant overlap between selected features during feature selection. In nearly every case 3-5 features were consistently selected as important features using sequential feature selection. It appears that sex, chest pain type, fasting blood sugar, old peak, and st slope, are important predictive features in determining heart disease.

If you had more time what do you think can be done further to improve the results?

If given more time, I would perform SMOTE, or over/under sampling of the data in order to achieve a more generalized model. As stated before, the dataset was unbalanced with a 55%/44% split between having and not having heart disease. Although it is not the most unbalanced set, SMOTE, and over/under sampling well help artificially create a balanced dataset which will ultimately train a more generalizable model.

Default Models

Model Dataset Train Accuracy Test Accuracy 5-Fold Test Accuracy AUC 5-Fold AUC
Logistic Regression Data1 86.51% 84.78% 82.89% 0.84 0.91
Logistic Regression Data2 87.47% 85.33% 82.35% 0.85 0.91
Random Forest Data1 100% 86.96% 83.00% 0.86 0.91
Random Forest Data2 100% 85.33% 83.69% 0.85 0.90
Support Vector Mch Data1 87.47% 85.33% 83.00% 0.85 0.91
Support Vector Mch Data2 87.06% 85.33% 82.02% 0.85 0.90
XGBoost Data1 100% 84.24% 81.36% 0.84 0.90
XGBoost Data2 100% 82.61% 82.02% 0.82 0.90
LightGBM Data1 100% 84.78% 82.67% 0.84 0.90
LightGBM Data2 100% 83.70% 81.58% 0.83 0.90
AdaBoost Data1 88.96% 82.61% 81.59% 0.83 0.87
AdaBoost Data2 89.37% 83.15% 80.28% 0.83 0.87
CatBoost Data1 97.82% 86.96% 83.76% 0.86 0.91
CatBoost Data2 97.55% 85.87% 83.87% 0.85 0.91

Randomized Search Models

Model Dataset Train Accuracy Test Accuracy 5-Fold Test Accuracy AUC 5-Fold AUC
Logistic Regression RS Data1 86.38% 86.96% 82.35% 0.86 0.91
Logistic Regression RS Data2 86.24% 85.33% 81.37% 0.85 0.90
Random Forest RS Data1 92.10% 88.59% 83.98% 0.88 0.91
Random Forest RS Data2 93.73% 86.41% 82.78% 0.86 0.91
Support Vector Mch RS Data1 87.47% 85.33% 83.44% 0.85 0.90
Support Vector Mch RS Data2 87.06% 85.33% 81.47% 0.85 0.90
XGBoost RS Data1 92.51% 86.96% 83.11% 0.86 0.89
XGBoost RS Data2 91.96% 88.04% 83.43% 0.87 0.89
LightGBM RS Data1 94.14% 85.87% 82.35% 0.85 0.89
LightGBM RS Data2 90.19% 87.50% 82.35% 0.86 0.89
AdaBoost RS Data1 87.60% 86.41% 82.89% 0.86 0.89
AdaBoost RS Data2 87.60% 86.41% 82.89% 0.86 0.89
CatBoost RS Data1 97.28% 86.96% 83.54% 0.86 0.91
CatBoost RS Data2 93.73% 88.04% 82.89% 0.88 0.91

Feature Selected Models

Model Dataset Train Accuracy Test Accuracy 5-Fold Test Accuracy AUC 5-Fold AUC
Logistic Regression FS Data1 86.65% 84.24% 84.41% 0.84 0.92
Logistic Regression FS Data2 85.97% 84.24% 83.98% 0.84 0.91
Random Forest FS Data1 91.69% 83.70% 83.65% 0.84 0.89
Random Forest FS Data2 86.51% 82.61% 83.65% 0.83 0.89
Support Vector Mch FS Data1 85.69% 83.15% 84.20% 0.83 0.92
Support Vector Mch FS Data2 81.47% 80.98% 79.07% 0.80 0.90
XGBoost FS Data1 86.51% 82.07% 80.93% 0.82 0.89
XGBoost FS Data2 91.28% 82.61% 82.99% 0.82 0.89
LightGBM FS Data1 91.14% 84.24% 84.19% 0.84 0.91
LightGBM FS Data2 90.74% 83.15% 83.54% 0.83 0.90
AdaBoost FS Data1 86.24% 83.70% 82.13% 0.83 0.91
AdaBoost FS Data2 86.10% 83.70% 85.17% 0.83 0.91
CatBoost FS Data1 95.50% 84.78% 84.85% 0.84 0.92
CatBoost FS Data2 90.46% 85.33% 84.96% 0.85 0.91